-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Updated extrapolation docs for clarity and adding an image #15325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
develop-docs/application-architecture/dynamic-sampling/extrapolation.mdx
Show resolved
Hide resolved
| --- | ||
|
|
||
| Dynamic sampling reduces the amount of data ingested, for reasons of both performance and cost. When configured, a fraction of the data is ingested according to the specified sample rate of a project: if you sample at 10% and initially have 1000 requests to your site in a given timeframe, you will only see 100 spans in Sentry. Without making up for the sample rate, any metrics derived from these spans will misrepresent the true volume of the application. When different parts of the application have different sample rates, there will even be a bias towards some of them, skewing the total volume towards parts with higher sample rates. This bias especially impacts numerical attributes like latency, reducing their accuracy. To account for this fact, Sentry uses extrapolation to smartly combine the data to account for sample rates. | ||
| [Dynamic sampling](/application-architecture/dynamic-sampling) reduces the amount of data ingested, to help with both performance and cost. When configured, a fraction of the data is ingested according to the specified sample rates within a project. For example, if you sample 10% of 1000 requests to your site in a given timeframe, you will see 100 spans in Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [Dynamic sampling](/application-architecture/dynamic-sampling) reduces the amount of data ingested, to help with both performance and cost. When configured, a fraction of the data is ingested according to the specified sample rates within a project. For example, if you sample 10% of 1000 requests to your site in a given timeframe, you will see 100 spans in Sentry. | |
| Client and Server side sampling reduces the amount of data ingested, to help with both performance and cost. When configured, a fraction of the data is ingested according to the specified sample rates within a project. For example, if you sample 10% of 1000 requests to your site in a given timeframe, you will see 100 spans in Sentry. |
|
|
||
| - **Accuracy** refers to data being correct. For example, the measured number of spans corresponds to the actual number of spans that were executed. As sample rates decrease, accuracy also goes down because minor random decisions can influence the result in major ways. | ||
| - **Expressiveness** refers to data being able to express something about the state of the observed system. Expressiveness refers to the usefulness of the data for the user in a specific use case. | ||
| - **Usefulness** refers to data being able to express something about the state of the observed system, and the value of the data for the user in a specific use case. For example, a metric that shows the P90 latency of your application is useful for understanding the performance of your application, but a metric that shows the P90 latency of different endpoints in your application sampled at 10%, 1%, and 5% is not as useful because it is not a complete picture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Usefulness** refers to data being able to express something about the state of the observed system, and the value of the data for the user in a specific use case. For example, a metric that shows the P90 latency of your application is useful for understanding the performance of your application, but a metric that shows the P90 latency of different endpoints in your application sampled at 10%, 1%, and 5% is not as useful because it is not a complete picture. | |
| - **Usefulness** refers to data being able to express something about the state of the observed system, and the value of the data for the user in a specific use case. For example, a metric that shows the P90 latency of your application is useful for understanding the performance of your application, but a metric that shows the P90 latency of different endpoints in your application sampled at 10%, 1%, and 5% may not be as useful because it cannot represent the complete picture. |
| - **Sample mode** does not extrapolate and presents exactly the data that was ingested - targeting accuracy, especially for small datasets. | ||
|
|
||
| Depending on the context and the use case, one mode may be more useful than the other. Generally, default mode is useful for all queries that aggregate on a dataset of sufficient volume. As absolute sample size decreases below a certain limit, default mode becomes less and less expressive. There are scenarios where the user needs to temporarily switch between modes, for example, to examine the aggregate numbers first and dive into the number of samples for investigation. In both modes, the user may investigate single samples to dig deeper into the details. | ||
| Depending on the context and the use case, one mode may be better suited than the other. Generally, default mode is useful for all queries that aggregate on a dataset of sufficient volume. As absolute sample size decreases below a certain limit, default mode becomes less and less useful. There are scenarios where you may need to temporarily switch between modes, for example, to examine the aggregate numbers first and dive into the number of samples for investigation. In both modes, you may investigate single samples to dig deeper into the details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Depending on the context and the use case, one mode may be better suited than the other. Generally, default mode is useful for all queries that aggregate on a dataset of sufficient volume. As absolute sample size decreases below a certain limit, default mode becomes less and less useful. There are scenarios where you may need to temporarily switch between modes, for example, to examine the aggregate numbers first and dive into the number of samples for investigation. In both modes, you may investigate single samples to dig deeper into the details. | |
| Depending on the context and the use case, one mode may be better suited than the other. Generally, default mode is useful for all queries that aggregate on a dataset of sufficient volume. As absolute sample size decreases below a certain limit, default mode becomes less and less useful. There are scenarios where you may need to temporarily switch between modes, for example, to track usage and identify which endpoints or operations consume most spans |
| - **Default mode** extrapolates the ingested data as outlined below. | ||
| - **Sample mode** does not extrapolate and presents exactly the data that was ingested. | ||
| - **Default mode** extrapolates the ingested data as outlined below - targeting usefulness. | ||
| - **Sample mode** does not extrapolate and presents exactly the data that was ingested - targeting accuracy, especially for small datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please not have a Samples Mode concept - this is just turning off extrapolation - so we can call it "Unextrapolated Mode" ?
|
|
||
| 1. **When both sample rate and event volume are low**: Extrapolation becomes less reliable in these cases. You can either increase your sample rate to improve accuracy, or switch to sample mode to examine the actual events - both are valid approaches depending on the user's needs. | ||
| 1. **When both sample rate and event volume are low**: Extrapolation becomes less reliable in these cases. You can either increase your sample rate to improve accuracy, or switch to sample mode to examine the actual events - both are valid approaches depending on your needs. | ||
| 2. **When you have a high sample rate but still see low event volumes**: In this case, increasing the sample rate won't help capture more data, and sample mode will give you a clearer picture of the events you do have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would suggest rewriting this part - we never want ppl to disable extrapolation for anything other than debugging usage -
|
|
||
| ### Confidence | ||
| When users filter on data that has a very low count but also a low sample rate, yielding a highly extrapolated but low-sample dataset, developers and users should be careful with the conclusions they draw from the data. The storage platform provides confidence intervals along with the extrapolated estimates for the different aggregation types to indicate when there is elevated uncertainty in the data. These types of datasets are inherently noisy and may contain misleading information. When this is discovered, the user should either be very careful with the conclusions they draw from the aggregate data or switch to non-default mode for investigation of the individual samples. | ||
| When you filter on data that has a very low count but also a low sample rate, yielding a highly extrapolated but low-sample dataset, you should be careful with the conclusions you draw from the data. The storage platform provides confidence intervals along with the extrapolated estimates for the different aggregation types to indicate when there is lower confidence in the data. These types of datasets are inherently noisy and may contain misleading information. When this is discovered, you should either be very careful with the conclusions you draw from the aggregate data or switch to sample mode to investigate the individual samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should include this image here: https://docs.sentry.io/product/explore/trace-explorer/#sampling-warnings
DESCRIBE YOUR PR
Updated extrapolation docs for clarity and adding an image to make it clearer how extrapolation works.
IS YOUR CHANGE URGENT?
Help us prioritize incoming PRs by letting us know when the change needs to go live.
SLA
Thanks in advance for your help!
PRE-MERGE CHECKLIST
Make sure you've checked the following before merging your changes: